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Abstract 

Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Espe- 
cially powerful are models which take account of the heterogeneous nature of sequence evolution according 
to the "grammar" of the encoded gene features. However, beyond a modest level of model complexity, 
manual coding of models becomes prohibitively labor-intensive. 

We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate 
(macros and Scheme extensions). These features allow rapid implementation of phylogenetic models 
which would have previously been far more labor-intensive. XRate's new capabilities for lineage-specific 
models, ancestral sequence reconstruction, and improved annotation output are also discussed. 
XRate's flexible model-specification capabilities and computational efficiency make it well-suited to de- 
veloping and prototyping phylogenetic grammar models. XRate is available as part of the DART software 



package: http : //biowiki . org/DART . 



Introduction 

Phylogenetics, the modeling of evolution on trees, is an extremely powerful tool in computational biology. 
The better we can model a system, the more can learn from it, and vice-versa. Especially attractive, given 
the plethora of available sequence data, is modeling sequence evolution at the molecular level. Models 
describing the evolution of a single nucleotide began simply (e.g. JC69 [1 ), later evolving to capture 
such biological features as transition/transversion bias (e.g. K80 [2]) and unequal base frequencies (e.g. 
HKY85 [3]). Felsenstein's "pruning" algorithm allows combining these models with phylogenetic trees to 
compute the likelihood of multiple sequences pi. 
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As powerful as phylogenetic models are for explaining the evolutionary depth of a sequence alignment, 
they are even more powerful when combined with a model for the feature structure: the partition of the 
alignment into regions, each evolving under a particular model. The phylogenetic grammar, or "phylo- 
grammar" , is one such class of models. Combining hidden Markov models (and, more generally, stochastic 
grammars) and phylogenetic substitution models provides computational modelers with a rich set of 
comparative tools to analyze multiple sequence alignments (MSAs): gene prediction, homology detection, 
finding structured RNA, and detecting changes in selective pressure have all been approached with this 
general framework 5-8 . Readers unfamiliar with phylo-grammars may benefit from relevant descriptions 



and links available here: http://biowiki.org/PhyloGrammars or the original paper describing XRate 
[9] . Also, a collection of animations depicting various evolutionary models at work (generating multiple 



alignments or evolving sequences) has been compiled here: |http : / /biowiki . org/PhyloFilm 



While the mathematics of sequence modeling is straightforward, manual implementation can quickly 
become the limiting factor in iterative development of a computational pipeline. To streamline this step, 
general modeling platforms have been developed. For instance, Exonerate allows users to specify a wide 



variety of common substitution and gap models when aligning pairs of sequences 10 . Dynamite uses a 



specification file to generate code for dynamic programming routines 11 . HMMoC is a similar model 



compiler sufficiently general to work with arbitrary HMMs [12]. The BEAST program allows users to 



choose from a wide range of phylogenetic substitution models while also sampling over trees 13 . The 
first three of these are non-phylogenetic, only able to model related pairs of sequences. Dynamite and 
HMMoC are unique in that they allow definition of arbitrary models via specification files, whereas users 
of BEAST and Exonerate are limited to the range of models which have been hard-coded in the respective 
programs. 

Defining models' structure manually can be limiting as models grow in size and/or complexity. For 
instance, a Nielsen- Yang model incorporating both selection and transition/transversion bias has nearly 
4000 entries - far too many for a user to manually specify [14] . Such a large matrix requires specific 
model-generating code to be written and integrated with the program in use - not always possible or 
practical for the user depending on the program's implementation. 

XRate is a phylogenetic modeling program that implements the key parameterization and inference 
algorithms given two ingredients: a user-specified phylo-grammar, and a multiple sequence alignment. (A 
phylogeny can optionally be specified by the user, or it can be inferred by the program.) XRate's models 
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describe the parametric structure of substitution rate matrices, along with grammatical rules governing 
which rate matrices can account for which alignment columns. This essentially amounts to partitioning 
the alignment (e.g. marking up exon boundaries and reading frames) and factoring in the transitions 
between the different types of region. 

Parameter estimation and decoding (alignment annotation) algorithms are built in, allowing fast 
model prototyping and fitting. Model training (estimating the rate and probability parameters of the 
grammar) is done via a form of the Expectation Maximization (EM) algorithm, described in more detail 
in the original XRate paper [9J. Most recently, XRate allows programmatic model construction via its 
macros and Scheme extensions. XRate's built-in macro language allows large, repetitive grammars to be 
compactly represented, and also enables the model structure to depend on aspects of the data, such as the 
tree or alignment. Scheme extensions take this even further, interfacing XRate to a full-featured functional 
scripting language, allowing complex XRate-oriented workflows to be written as Scheme programs. 

In this paper we demonstrate XRate's new model-specification tools via a set of progressively more 
complex examples, concluding with XDecoder, a phylo-grammar modeling RNA secondary structure 
overlapping protein-coding regions. We also describe additional improvements to XRate since its initial 
publication, namely ancestral sequence reconstruction, GFF/WIG output, and hybrid substitution mod- 
els. Finally, we show how XRate's features are exposed as function extensions in a dialect of the Scheme 
programming language, typifying a Functional Programming (FP) style of model development and in- 
ference for phylogenetic sequence analysis. Terminology relevant to modeling with XRate are defined in 
detail in Appendix Section[A| We also provide an online tutorial for making nontrivial modifications to ex- 
isting grammars, going step-by-step from a Jukes-Cantor model to an autocorrelated Gamma-distributed 
rates phylo-HMM: http://biowiki.org/XrateTutorial 



Results and Discussion 

The XRate generative model 

A phylo-grammar generates an alignment in two steps: nonterminal transformations and token evolution. 
The sequence of nonterminal transformations comprises the "grammar" portion of a phylo-grammar, and 
the "phylo" portion refers to the evolution of tokens along a phylogeny. First, transformation rules are 
repeatedly applied, beginning with the START nonterminal, until only a series of pseudoterminals remains. 
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From each group of pseudoterminals (a group may be a single column, two "paired" columns in an RNA 
structure, or a codon triplet of columns), a tuple of tokens is sampled from the initial distribution of 
the chain corresponding to the pseudoterminal. These tokens then evolve down the phylogenetic tree 
according to the mutation rules of the chain, resulting in the observed alignment columns. 

If the nonterminal transformations contain no bifurcations and all emissions occur on the same side of 
the nonterminal, the grammar is a phylogenetic hidden Markov model (phylo-HMM), a special subclass 
of phylo-grammars. Otherwise, it is a phylogenetic stochastic context-free grammar (phylo-SCFG), the 
most general class of models implemented by XRate. This distinction, along with other related technical 
terms, are described in greater detail in Appendix Section [XJ the Glossary of XRate model terminology. 

The generality of XRate requires a slight tradeoff against speed. Since the low-level code implementing 
core operations is shared among the set of possible models, XRate will generally be slower than programs 
with source code optimized for a narrower range of models. Computing the Felsenstein likelihood under 
the HKY85 [3j model of a 5-taxon, 1Mb alignment, XRate required 1.25 minutes of CPU time and 116MB 
RAM, while PAML required 9 seconds of CPU time and 19MB RAM for the same operation. Running 



PFOLD 15 on a 5-taxon, 1KB alignment required 11 seconds and 164MB RAM, and running XRate on 
the same alignment with a comparable grammar required 25 seconds and 62MB RAM. All programs were 
run with default settings on a 3.4 GHz Intel i7 processor. Model-fitting also takes longer with XRate: a 
previous work found that XRate's parameter estimation routines were approximately 130 times slower 



than those in PAML 16 



In an attempt to improve XRate's performance, we tried using Beagle, a library that provides 
CPU and accelerated parallel GPU implementations of Felsenstein's algorithm along with related matrix 
operations [17| . We have, however, been so far unable to generate significant performance gains by this 
method. 

Despite these caveats, XRate has proved to be fast enough for genome-scale applications, such as a 
screen of Drosophila whole-genome alignments [18] . Furthermore, it implements a significantly broader 
range of models than the above-cited tools. 

XRate inputs, outputs and operations 

The formulation of the XRate model presented in the previous section is generative: that is, it describes 
the generation of data on a tree. In practice, the main reason for doing this is to generate simulation 
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data for benchmarking purposes. This is possible using the tool simgram 19 , which is provided with 
XRate as part of the DART package. 

Most common use cases for generative models involve not simulation, but inference: that is, recon- 
structing aspects of the generative process (sequence of nonterminal transformations, token mutations, 
or grammar parameters) given observed sequence data (in the form of a multiple sequence alignment). 
Using a phylo-grammar, a set of aligned sequences, and a phylogeny relating these sequences (optionally 
inferred by XRate), XRate implements the relevant parameterization and inference algorithms, allowing 
researchers to analyze sequence data without having to implement their own models. 

Sequences are read and written in Stockholm format [20] (converters to and from common formats 



are included with DART). This format allows for the option of embedding a tree in Newick format 21 
(via the #=GF NH tag) and annotations in GFF format [22]. By construction, Newick format necessarily 
specifies a rooted tree, rather than an unrooted one. However, the root placement is only relevant for 
time-irreversible models; when using time-reversible models, the placement of the root is arbitrary and 
can safely be ignored. Given these input ingredients, a call to XRate proceeds in the following order 



(more detail is provided at http://biowiki.org/XRATE and http://biowiki.org/XrateFormat ): 



1. The Stockholm file and grammar alphabet are parsed (as macros may depend on these). 

2. Any grammar macros are expanded, followed by Scheme functions. 

3. If requested, or a tree was not provided in the input data, one is estimated using neighbor-joining 



23 . As noted above, this is a rooted tree, but the root placement is arbitrary if a time-reversible 



model is used. 

4. Grammar parameters are estimated (if requested). 

5. Alignment is annotated (if requested) . 

6. Ancestral sequences are reconstructed (if requested). 

After the analysis is complete, the alignment (along with an embedded tree) is printed to the output 
stream along with ancestral sequences (if requested) as well as any #=GC and #=GR column annotations. 
GFF and WIG annotations are sent to standard output by default, but these can be directed to separate 
files by way of the -gf f and -wig options, respectively. 
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The XRate format macro language for phylo-grammar specification: case stud- 
ies 

The following sections describe case studies of repetitively-structured models which motivate the need for 
grammar-generating code. Historically, we have attempted several solutions to the case studies described. 
We first briefly review the factors that influenced our eventual choice of Scheme as a macro language. 
XRate was preceded by Searls' Prolog-based automata [24] and Birney's Dynamite parser-generator 




versions of XRate (circa 2004), and in Exonerate, the only way for the user to specify their own phylo- 
grammar models was to write C/CH — h code that would compile directly against the program's internal 
libraries. This kind of compilation step significantly slows model prototyping, and impedes re-use of 
model parameters. 

Current versions of XRate, along with Dynamite and HMMoC, understand a machine-readable gram- 
mar format. In the case of XRate, this format is based on Lisp S-expressions. In such formats (as the case 
studies illustrate) the need arises for code that generates repetitively-structured grammar files. It is often 
convenient, and sometimes sufficient, to write such grammar-generating code in an external language: 
for example, we have written Perl, Python and C++ libraries to generate XRate grammar files [9|[l6]. 
However, this approach still has the disadvantage (from a programmer's or model developer's perspective) 
that (a) code to generate real grammars tends to require an ungainly mix of grammar-related S-expression 
constants embedded in Perl/Python/C++ code, and (b) the requirement for an explicit model-generation 
step can delay prototyping and evaluation of new phylo-grammar models. 

XRate's macro language provides an alternate way to generate repetitive models within XRate, with- 
out having to resort to external code-generating scripts. This allows the model-specifying code to remain 
compact, readable, and easy to edit. As we report in this manuscript, the XRate grammar format now 
also natively includes a Scheme-based scripting language that can be embedded directly within grammar 
files, whose syntax blends seamlessly with the S-expression format used by XRate and whose functional 
nature fits XRate's problem domain. We provide here examples of common phylogenetic models which 
make use of various macro features, and refer the reader to the online documentation for a complete 
introduction to XRate's macro features: |http : //biowiki . or g/XrateMacros . All of the code snippets 
presented here are available as minimal complete grammars in Text SI. The full, trained grammars 




11 , and roughly contemporaneous with Slater's 



Exonerate [To] and Lunter's HMMoC 12 1. In early 
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corresponding to those presented here are available as part of DART. This correspondence is described 
here: http : //biowiki . org/XratePaper2011 



A repetitively-structured HMM specified using simple macros 

Probabilistic models for the evolution of biological sequences tend to contain repetitive structure. Some- 
times, this structure arises as a reflection of symmetries in the phylo-grammar; other times, it arises due 
to structure in the data, such as the tree or the alignment. While small repetitive models can be written 
manually, developing richer evolutionary models and grammars often demands writing code to model the 
underlying structure. 

Markov chain symmetry The most familiar source of repetition derives from the substitution model's 
structure: different substitutions share parameters based on prior knowledge or biological intuition. 
Perhaps most repetitive is the Jukes-Cantor model for DNA. The matrix entries Qij denote the rate of 
substitution from i to j: 
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Here u is an arbitrary positive rate parameter. The * character denotes the negative sum of the 
remaining row entries (here equal to — 3u in every case). The parameter u is typically set to 1/3 in order 
that the stochastic process performs, on average, one substitution event per unit of time. 

This matrix can be specified in XRate with two nested loops over alphabet tokens. Each loop over 
alphabet tokens has the form (&f oreach-token X expression. . . ) where expression. . . is a construct 
to be expanded for each alphabet token X. Here, expression sets the substitution rate between each pair 
of source and destination tokens (except for the case when the source and destination tokens are identical, 
for which case we simply generate an empty list, , which will be ignored by the XRate grammar parser). 
We do not explicitly need to write the negative values of the on-diagonal matrix elements (labeled * in 
the above description of the matrix); XRate will figure these out for itself. To check whether source 
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and destination tokens are equal in the loop, we use a conditional &if statement, which has the form 
(&if (condition) (expansion-if-true) (expansion-if-f alse) ) . The condition is implemented 
using the &eq macro, which tests if its two arguments are equal. Putting all these together, the nested 
loops look like this: 

(&f oreach-token tokl 
(&f oreach-token tok2 
(&if (&eq tokl tok2) 
() ; ; If tokl==tok2, expand to an empty list (ignored by parser) 
(mutate (from (tokl)) (to (tok2)) (rate u))))) 

While this illustrates XRate's looping and conditional capabilities, such a simple model would almost 
be easier to code by hand. For a slightly more complex application, we turn to the model of Pupko et al 
in their 2008 work. In their RASER program the authors used a chain augmented with a latent variable 
indicating "slow" or "fast" substitution. Reconstructing ancestral sequences on an HIV phylogeny allowed 
them to infer locations of transitions between slow and fast modes - indicating a possible gain or loss 
of selective pressure [25]. The chain shown below, Q RASER , shows a simplified version of their model: 
substitutions within rate classes occur according to a JC69 model scaled by rate parameters s and / (slow 
and fast, respectively), and transitions between rate classes occur with rates r s / and r/ s (slow — > fast and 
fast — > slow, respectively). 
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While this chain contains four times as many rates as the basic JC69 model, there are only five param- 
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eters: u, s, /, r s f, r/ s since the model contains repetition via its symmetry. While manual implementation 
is possible, the model can be expressed in just a few lines of XRate macro code. Further, additional 
"modes" of substitution (corresponding to additional quadrants in the matrix above) can be added by 
editing the first two lines of the following code. 

XRate represents latent variable chains as tuples of the form (state class), where state is a 
particular state of the Markov chain and class is the value of a hidden variable. In this case, standard 
DNA characters are augmented with a latent variable indicating substitution rate class: Af indicates an 
A which evolves "fast." The following syntax is used to declare a latent variable chain (in this case, this 
variable may take values s or f ), with the row tag specifying CLASS as the Stockholm #=GR identifier for 
per-sequence, per-column annotations: 

(hidden-class (row CLASS) (label (s f))) 

Combining loops, conditionals, hidden classes, and the (&cat LIST) function (which concatenates 
the elements of LIST), we get the following XRate code for the RASER chain: 

(rate (s 0.1) (f 2.0) (r_sf 0.01) (r_fs 0.01) (u 1.0)) 
(chain 

(hidden-class (row CLASS) (label (s f))) 
(terminal RASER) 
(feforeach classl (s f) 
(feforeach class2 (s f) 
(&f oreach-token tokl 
(&f oreach-token tok2 
(&if (&eq classl class2) 
(&if (&eq tokl tok2) 
() ; ; if classl==class2 kk tokl==tok2, expand to empty list (will be ignored) 
;; The following line handles the case (classl==class2 kk tokl!=tok2) 
(mutate (from (tokl classl)) (to (tok2 class2)) (rate u classl))) 
(&if (&eq tokl tok2) 
;; The following line handles the case (classl ! =class2 kk tokl==tok2) 
(mutate (from (tokl classl)) (to (tok2 class2)) (rate (fecat r_ classl class2))) 
()))))))) ;; if classl !=class2 kk tokl!=tok2, expand to empty list (ignored) 
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Phylo-HMM-induced repetition The previous examples both involved specifying the Markov chain 
component of a phylo-grammar. Coupled with a trivial top-level grammar (a START state and an EMIT 
state which emits the chain via the EMIT* pseudoterminal), these models describe an alignment where 
each column's characters evolve according to the same substitution model. A common extension to this 
is using sequences of hidden states which generate alignment columns according to different substitution 
models. These "phylo-grammars" (which can include phylo-SCFGs and the more restricted phylo-HMMs) 
allow modelers to describe and/or detect alignment regions exhibiting different evolutionary patterns. 
Phylo-HMMs model left-to-right correlations between alignment columns, and phylo-SCFGs are capable 
of modeling nested correlations (such as "paired" columns in an RNA secondary structure). Readers 



unfamiliar with phylo-grammars may benefit from relevant descriptions and links available here: http : 



//biowiki.org/PhyloGrammars, animations available here: http://biowiki.org/PhyloFilm, and the 
original paper describing XRate |9j. 

We outline here a phylo-HMM that is simple to describe, but would take a substantial amount of code 
to implement without XRate's macro language. The model is based on PhastCons, a program by Siepel 
et al which uses an HMM whose three states (or, in XRate terminology, nonterminals) use substitution 
models differing only by rate multipliers [26 . This model, depicted schematically in Figure [T] can be 
used to detect alignment regions evolving at different rates. If the rates of each hidden state correspond 
to quantiles of the Gamma distribution, then summing over hidden states of this model is equivalent to 
the commonly- used Gamma model of rate heterogeneity. We provide this grammar in Text SI, which 
is essentially identical to the PhastCons grammar with n states except for its invocation of a Scheme 
function returning the n Gamma-derived rates for a given shape parameter. We can define such a 
model in XRate easily due to the symmetric structure: all three nonterminals have similar underlying 
substitution models (varying only by a multiplier) and also similar probabilities of making transitions to 
other nonterminals via grammar transformation rules. 

The grammar will have nonterminals named "1", "2". ..up to numNonTerms, each one associated with a 
rate parameter (r_l, r_2. . .) and substitution chain (chain_l, chain_2. . .). To express this grammar 
in XRate macro code, we'll need to declare each of these nonterminals, the production rules which govern 
transitions between them, rate parameters, and the nonterminal-associated substitution chains. (For a 
fully-functional grammar, an alphabet is also needed; these are omitted in code snippets included in the 
main text, but the corresponding grammars in Text SI contain alphabets.) 
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First, define how many nonterminals the model will have: adding more nonterminals to the model later 
on can be done simply by adjusting this variable. We define a SEED value to initialize the rate parameters 
(this is not a random number seed, but rather an initial guess at the parameter value necessary for the 
EM algorithm to begin), which is done inside a f oreach-integer loop using the numNonterms variable. 
The (f oreach-integer X (1 K) expression) expands expression for all values of X from 1 to K. In 
this case, we define a rate parameter for each of our nonterminals 1..K. 

(fedefine numNonterms 3) 
(fedefine SEED 0.001) 

(&f oreach-integer nonterminal (1 numNonterms) 
(rate ((fecat r_ nonterminal) SEED))) 

Next, define a Markov chain for each nonterminal: all make use of the same underlying substitution 
model (e.g. JC69 [I], HKY85 [3]) whose entries are stored as Q_a_b for the transition rate between 
characters a and b. This "underlying" chain must be defined elsewhere - either in an included file (using 
the (feinclude) directive), or directly in the grammar file. For instance, we could re-use the JC69 chain, 
declaring rate parameters for later use: 

(&f oreach-token tokl 
(&f oreach-token tok2 
(&if (&eq tokl tok2) 
() ; ; If tokl==tok2, expand to an empty list (ignored by parser) 
(rate (fecat Q_ tokl _ tok2) u )))) 

Each nonterminal has an associated substitution model which is Q_a_b scaled by a different rate 
multiplier r_nonterminal. Using an integer loop, we create a chain for each nonterminal using the rate 
parameters we defined in the two previous code snippets: 

(&f oreach-integer nonterminal (1 numNonterms) 
(chain 

(terminal (fecat chain_ nonterminal)) 
(&f oreach-token tokl 
(&f oreach-token tok2 
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(&if (feeq tokl tok2) 


(mutate (from (tokl)) (to (tok2)) 

(rate (fecat Q_ tokl _ tok2) (fecat r_ nonterminal)))))))) 

Next, define the production rules which govern the nonterminal transitions. For simplicity of presenta- 
tion (but not required) , we assume here that transitions between nonterminals all occur with probability 
proportional to leaveProb, and all self-transitions have probability stayProb. 

The pgroup declaration defines a probability distribution over a finite outcome space, with the pa- 
rameters declared therein normalized to unity during parameter estimation. In this grammar we declare 
stayProb and leaveProb within a pgroup since they describe the two outcomes at each step of creating 
the alignment: staying at the current nonterminal or moving to a different one. 

(pgroup (stayProb 0.9) (leaveProb 0.1)) 
(feforeach- integer nonterml (1 numNonterms) 
; ; Each nonterminal has a transition from start 

(transform (from (start)) (to (nonterml)) (prob (&/ 1 numNonterms))) 
; ; Each nonterminal can transition to end - we assign this prob 1 
; ; since the alignment length directs when this transition occurs 
(transform (from (nonterml)) (to ()) (prob 1)) 
(feforeach- integer nonterm2 (1 numNonterms) 
(feif (feeq nonterml nonterm2)) 

;; If nonterml==nonterm2 , this is a self -transition 

(transform (from (nonterml)) (to (nonterm2)) (prob stayProb)) 

;; Otherwise, this is an inter-nonterminal transition 

; ; with probability changeProb / (numNonterms - 1) 

(transform (from (nonterml)) (to (nonterm2)) 

(prob (&/ changeProb (&- numNonterms 1)))))) 

Lastly, associate each nonterminal with its specially-designed Markov chain for emitted alignment 
columns: 

(feforeach- integer nonterminal (1 numNonterms) 
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(transform (from (nonterminal) ) (to ((fecat chain_ nonterminal) (&cat nonterminal *)))) 
(transform (from ((fecat nonterminal *))) (to (nonterminal)))) 

Data-induced repetition Models whose symmetric structure depends on the input data are less 
common in phylogenetic analysis, perhaps because normally their implementation requires creating a 
new model for each new dataset to be analyzed. XRate allows the user to create models based on 
different parts of the input data, namely the tree and the alignment, "on the fly" via its macro language. 
This is accomplished by making use of the tree iterators (e.g. &BRANCHES, &N0DES, and &LEAVES) and 
alignment data (e.g. &C0LUMNS) to create nonterminals and/or terminal chains associated with these 
parts of the input data. 

In their program DLESS, Haussler and colleagues used such an approach in a tree-dependent model 
to detect lineage-specific selection. Their model used a phylo-HMM with different nonterminals for 
each tree node, with the substitution rate below this node scaled to reflect gain or loss of functional 
elements 26 . We show a simplified form of their model as a schematic in Figure [2j with blue colored 



branches representing a slowed evolutionary rate. 

Using XRate's macros we can express this model in a compact way just as was done with the PhastCons 
model. Since both models use a set of nonterminals with their own scaled substitution models, we need 
simply to replace the integer-based loop (&f oreach-integer nonterminal (1 numNonterms) expression) 
with the tree-based loop (&f oreach-node state expression) to create a nonterminal for each 
node in the tree. Then, define each node-specific chain as a hybrid chain, such that the chain associated 
with tree node n has all the branches below node n scaled to reflect heightened selective pressure. Hybrid 
chains, substitution processes which vary across the tree, are discussed briefly in the section on "Recent 
enhancements to XRate" , and the details of their specification is thoroughly covered in the XRate format 



documentation, available here: http://biowiki.org/XrateFormat . A minimal working form of the 



DLESS-style grammar included in Text SI. 

A repetitively-structured codon model specified using Scheme functions 

While XRate's macro language is very flexible, there are some relatively common models that are difficult 
to express within the language's constraints. For example, a Nielsen- Yang codon matrix incorporating 
transition bias and selection has nearly 4,000 entries whose rates are determined by the following criteria: 
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if i and j differ at more than one position 



if i and j differ by a synonymous transversion 




KITj 



if i and j differ by a synonymous transition 



if i and j differ by a nonsynonymous transversion 



uiKiTj if i and j differ by a nonsynonymous transition 



This sort of Markov chain is difficult to express in XRate's macro language since its entries are 
determined by aspects of the codons (synonymous changes and transitions/transversions) which in turn 
depend on knowledge of the properties of nucleotides and codons that would have to be hard-coded 
directly into the loops and conditionals afforded by XRate's macros. The conditions on the right side 
of the above equation are better framed as values returned from a function: given a pair of codons, the 
function returns the "type" of difference between them, which in turn determines the rate of substitution 
between the two codons. 

Scheme extensions It is this sort of situation which motivates extensions to XRate that are more 
general-purpose than the simple macros described up to this point. There are several valid choices for the 
programming language that can be used to implement such extensions. For example, a chain such as Q NY 
can be generated fairly easily by way of a Perl or Python script tailored to generate XRate grammar code. 
While this is a convenient scripting mechanism for many users (and is perfectly possible with XRate), 
it tends to lead to an awkward mix of code and embedded data (i.e. snippets of grammar-formatting 
text). This obscures both the generating script and the final generated grammar file (the former due to 
the code/data mix, and the latter due to sheer size). 

Another choice of programming language for implementing XRate extensions, which suffers slightly 
less from these limitations, is Scheme. As XRate's macro language is based on Lisp (the parent language 
to Scheme), the syntaxes are very similar, so the "extension" blends naturally with the surrounding XRate 
grammar file. Scheme is inherently functional and is also "safe" (in that it has garbage collection). Lastly, 
data and code have equivalent formats in Scheme, enabling the sort of code/data mingling outlined above. 

To implement the Q NY chain in XRate, we can use the XRate Scheme standard library (found in 
dart/scheme/xrate-stdlib . scm). This standard library implements all the necessary functions to define 
the Nielsen- Yang model, with the genetic code implemented as a Scheme association list (facilitating easy 



Phylogenetic modeling with XRate 



15 



substitution of alternate genetic codes, such as the mitochondrial code) as well as a wrapper function to 
initialize the entire model. 

Without stepping through every detail of the Scheme implementation of the Nielsen- Yang model in 
the XRate standard library, we will simply note that this implementation (the Nielsen- Yang model on a 
DNA alphabet) is available via the following XRate code (the include path to dart/scheme is searched 
by default by the Scheme function load-f rom-path): 

(fescheme 

(load-f rom-path "xrate-stdlib . scm" ) 
xrate-dna-alphabet 
(xrate-NY-grammar) ) 

Note that xrate-dna-alphabet is a simple variable, but xrate-NY-grammar is a function and is 
therefore wrapped in parentheses (as per the syntax of calling a function in Scheme). The reason that 
xrate-NY-grammar is a function is so that the user can optionally redefine the genetic code, which (as 
noted above) is stored as a Scheme association list, in the variable codon-translation-table (the 
standard library code can be examined for details). 

A macro-heavy grammar for RNA structures in protein-coding exons 

As a final example of the possibilities that XRate's new model-specification features enable, we present 
a new grammar for predicting RNA structures which overlap protein-coding regions. XDecoder is based 
closely on the RNADecoder grammar first developed by Pederson and colleagues [27]. This grammar is 
designed to detect phylogenetic evidence of conserved RNA structures, while also incorporating the evolu- 
tionary signals brought on by selection at the amino-acid level. In eukaryotes, RNA structure overlapping 
protein coding sequence is not yet well-known, but in viral genomes this is a common phenomenon due to 
constraints on genome size acting on many virus families. XDecoder is available as an XRate grammar, 
linked here: http://biowiki.org/XratePaper2011 



Motivation for implementation Our endeavor to re-implement the RNADecoder grammar was based 
both on practical and methodological reasons. The original RNADecoder code is no longer maintained, 
but performs well on published viral datasets [28] . Running RNADecoder on an alignment of full viral 
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genomes is quite involved: the alignment must first be split up into appropriately-sized chunks (~300 
columns), converted to COL format [29], and linked to a tree in a special XML file which directs the 
analysis. The grammar and its parameters, also stored in an XML format, are difficult to read and 
interpret. RNADecoder attains remarkably higher specificity in genome- wide scans as compared to 



protein-naive prediction programs like PFOLD 15 or MFOLD 30 



Using XDecoder We developed our own variant of the RNADecoder model as an XRate grammar, 
called XDecoder. This would have been a protracted task without XRate's macro capabilities: the 
expanded grammar is nearly 4,000 lines of code. Using XRate's macros, the main grammar (excluding 
the pre-estimated dinucleotide Markov chain) is only "TOO lines of macro code. Starting with an alignment 
of full-length poliovirus genomes, annotated with reading frames, an analysis can be run with a single 
simple command: 

xrate -g XDecoder. eg -1 300 -wig polio. wig polio. stk > polio_annotated. stk 

This runs XRate with the XDecoder grammar on the Stockholm-format alignment polio, stk, allow- 
ing no more than 300 positions between paired columns, creating the wiggle file polio. wig, annotating 
the original alignment with maximum likelihood secondary structure and rate class indicators, and writing 
the annotated alignment to the the file polio_annotated. stk. 

Each analysis with RNADecoder requires an XML file to coordinate the alignment and tree as well as 
direct parts of the analysis (training and annotation) . XRate reads Stockholm format alignments which 
natively allows for alignment-tree association, enabling simple batch processing of many alignments. The 
grammar can be run on arbitrarily long alignments, provided a suitable maximum pair length is specified 
via the -1 N argument. This prevents XRate from considering any pairing whose columns are more than 
N positions apart, effectively limiting both the memory usage and runtime. 

Training the grammar's parameters, which may be necessary for running the grammar on significantly 
different datasets, is also accomplished with a single command: 
xrate -g XDecoder. eg -1 300 -t XDecoder. trained. eg polio. stk 

The results of an analysis using XDecoder are shown in Figure [3j together with gene and RNA struc- 
ture annotations. Also shown are three related analyses (all done using XRate grammars): PhastCons 
conservation, coding potential, and pairing probabilities computed using PFOLD. These three separate 
analyses reflect the signals that XDecoder must tease apart in order to reliably predict RNA structures. 
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DNA-lcvcl conservation could be due to protein-coding constraints, regional rate variation, pressure to 
maintain a particular RNA structure, or a combination of all three. Using codon-position rate multipliers, 
multiple rate classes, and a secondary structure model, XDecodcr unifies all of these signals in a single 
phylogenetic model, resulting in the highly-specific predictions shown at the top of Figure [3| 

Recent enhancements to XRate 
Lineage-specific models 

All Markov chains in phylo-grammars describe the evolution of characters starting at the root and ending 
at the tips of the tree. In lineage-specific models, or hybrid chains in XRate terminology, the requirement 
that all branches share the same substitution process is relaxed. Phylogenetic analysis is often used 
to detect a departure from a "null model" representing some typical evolutionary pattern. Standard 
applications of HMMs and SCFGs focus on modeling this departure on the alignment level, enabling 
different columns of the alignment to show different patterns of evolution. Using hybrid chains, users 
can explicitly model differences in evolution across parts of the tree. By combining a hybrid chain with 
grammar nonterminals, this could be used to detect alignment regions (i.e. subsets of the set of all sites) 
which display unusually high (or low) mutation rates in a particular part of the tree, such as in the DLESS 
model described in the section on "Data-induced repetition" . The details of specifying such models are 
contained within the XRate format documentation, at http://biowiki.org/XrateFormat 



Ancestral sequence reconstruction 

A phylo-grammar is a generative model: it generates a hidden parse tree, then further generates observed 
data conditional on that parse tree. The observed data here is an alignment of sequences; the hidden 
parse tree describes which alignment columns are to be generated by the evolutionary models associated 
with which grammar nonterminals. Inference involves reversing the generative process: reconstructing 
the hidden parse structure and evolutionary trajectories that explain the alignment. 

The original version of XRate was focused on reconstructing the parse tree, for the purposes of anno- 
tating hidden structures such as gene boundaries or conserved regions. A newly-implemented feature in 
XRate allows an additional feature: reconstruction of ancestral sequences. This functionality is already 
implicit in the phylogenetic model: no additional modification to the grammar is necessary to enable 
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reconstruction. The user can ask XRate to return the most probable ancestral sequence at each inter- 
nal node, or the entire posterior distribution over such sequences, via the -ar and -arpp command-line 
options. Since XRate does marginal state reconstruction, the character with the highest posterior prob- 
ability returned by the -arpp option will always correspond to the single character returned by the -ar 
option. Ancestral sequence reconstruction can be used to answer paleogenetic questions: what did the 
sequence of the ancestor to all of clade X look like? Similarly, evolutionary events such as particular 
substitutions or the gain or loss of function (also called trait evolution) can be pinpointed to particular 
branches. 



Direct output of GFF and Wiggle annotations 

XRate allows parse annotations to be written out directly in common bioinformatics file formats: GFF (a 
format for specifying co-ordinates of genomic features) [22] and WIG (a per-base format for quantitative 
data) [31]. 

This allows a direct link between XRate and visualization tools such as JBrowse 32 , GBrowse [33] , 



the UCSC Genome Browser 34 ,and Galaxy 35 , allowing the results of different analyses to be displayed 



next to one another and/or processed in a unified framework. 

GFF: Discrete genomic features GFF is a format oriented towards storing genomic features using 9 
tab-delimited fields: each line represents a separate feature, with each field storing a particular aspect of 
the feature (e.g. identifier, start, end, etc). With XRate, a common application is using GFF to annotate 
an alignment with features corresponding to grammar nonterminals. For instance, using a gene-prediction 
grammar one could store the predicted start and end points of genes together with a confidence measure. 
Similarly, predicted RNA base pairs could be represented in GFF as one feature per pair, with start and 
end positions indicating the paired positions. 

WIG: Quantitative values for each column(s) Wiggle format stores a quantitative value for a 
single or group of positions. This can be especially useful to summarize a large number of possibilities as 
a single representative value. For instance, when predicting regions of structured RNA, XRate may sum 
over many thousands of possible structures. We can summarize the model's results with the posterior 
probability that each column is involved in a base-pairing interaction. 
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The Dart Scheme (Darts) interpreter 

Another way to use XRate, instead of running it from the command line, is to call it from the Scheme 
interpreter (included in DART). The compiled interpreter executable is named "darts" (for "DART 
Scheme"). This offers a simple yet powerful way to create parameter-fitting and genome annotation 
workflows. For example, a user could train a grammar on a set of alignments, then use the resulting 
grammar to annotate a set of test alignments. 

Darts, in common with the Scheme interpreter used in XRate grammars, is implemented using Guile 
(GNU's Ubiquitous Intelligent Language for Extension: http : //www. gnu. org/ software/guile/guile .html). 
Certain commonly-encountered bioinformatics objects, serializable via standard file formats and imple- 
mented as C++ classes within XRate, are exposed using Guile's "small object" (smob) mechanism. 
Currently, these types include Newick-format trees and Stockholm-format alignments. API calls are 
provided to construct these "smobs" by parsing strings (or files) in the appropriate format. The smobs 
may then be passed directly as parameters to XRate API calls, or may be "unpacked" into Scheme data 
structures for individual element access. Guile encourages sparing use of smobs; consequently, smobs 
are used within Darts exclusively to implement bioinformatic objects that already have a broadly-used 
file format (Stockholm alignments and Newick trees). In contrast, formats that are newly-introduced by 
XRate (grammars, alphabets and so forth) are all based on S-expressions, and so may be represented 
directly as native Scheme data structures. 

The functions listed in Section [B] provide an interface between Scheme and XRate. Together with the 
functions in the XRate-scheme standard library and Scheme's native functional scripting abilities, a broad 
array of models and/or workflows are possible. For instance, one could estimate several sets of parameters 
for Nielsen- Yang models using groups of alignments, and then embed each one in a PhastCons-style phylo- 
HMM, finally using this model to annotate a set of alignments. While this and other workflows could 
be accomplished in an external framework (e.g. Make, Galaxy [35] ) , Darts provides an alternate way to 
script XRate tasks using the same language that is used to construct the grammars. 

Materials and Methods 

Text SI contains example grammars referred to in the text, as well as small and large test Stockholm 
alignments. The alignment of poliovirus genomes along with the grammars used to produce Figure [3] are 
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also included along with a Makefile indicating how the data was analyzed. Typing make help in the 
directory containing the Makefile will display the demonstrations available to users. 
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Figure 1. The model used by PhastCons, a 3-nonterminal HMM with rate multipliers, is 
compactly expressed by XRate's macro language. Different nonterminal have different 
evolutionary rates, but they all share the same underlying substitution model. Transition probabilities 
are shared: a transition between nonterminals happens with probability leaveProb, and self-transitions 
happen with probability stayProb. This model (with any number of nonterminals) can be expressed in 
XRate's macro language in approximately 20 lines of code. 



Phylogenetic modeling with XRate 



24 




Figure 2. A schematic of a DLESS-style phylo-HMM: each node of the tree has its own 
nonterminal, such that the node-rooted subtree evolves at a slower rate than the rest of 
the tree. Inferring the pattern of hidden nonterminals generating an alignment allows for detecting 
regions of lineage-specific selection. Expressing this model compactly in XRate's macro language allows 
it to be used with any input tree without having to write data-specific code or use external 
model-generating scripts. 
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Figure 3. Data from several XRate analyses, shown alongside genes (A) and known RNA 
structures (B) in poliovirus. XDecoder (C) recovers all known structures with high posterior 
probability and predicts a promising target for experimental probing (region 6800-7100). XDecoder was 
run on an alignment of 27 poliovirus sequences with the results visualized as a track in JBrowse 32 via 
a wiggle file. Alongside XDecoder probabilities are the three signals which XDecoder aims to 
disentangle: (D) conservation, (E) coding potential, and (F) RNA structure. Paradoxically the CRE 
and RNase-L inhibition elements show both conservation and coding sequence preservation, whereas 
PFOLD's predictions show only a slight increase in probability density around the known structures. 
XDecoder is the only grammar which returns predictions of reasonable specificity. The full JBrowse 
instance is included as Text S2. 



Tables 

A Glossary of XRate model terminology 

Within the glossary descriptions, italicized phrases refer to other glossary terms. 
Alignment: See multiple sequence alignment. 

Alphabet: The set of single-character tokens (symbols) from which sequences are constituted. The 
alphabet is defined in the grammar file; only one alphabet may be defined per grammar file. (Usually 
the alphabet is DNA, RNA or protein. Sometimes the alphabet is extended to include an explicit 
gap character.) An alphabet may optionally include a complement mapping, as well as specification 
of degenerate (ambiguous) characters. 

Ancestral reconstruction: The use of XRate to reconstruct the sequences at ancestral nodes of a 
phylogenetic tree, given a grammar, a multiple sequence alignment and a parse tree. This occurs 
after tree estimation, training and annotation. 

Annotation: The use of XRate to apply a grammar to a multiple sequence alignment and phylogenetic 
tree, so as to impute the optimal parse tree and mark up the alignment with the co-ordinates 
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of selected features (associated with particular nonterminals in the parse tree), or generate other 
annotation including GFF and WIGgle files. This occurs after tree estimation and training, but 
prior to ancestral reconstruction. Can also refer to a specific part of a transformation rule that 
generates annotations. 

Bifurcation: A transformation rule that generates two nonterminals. Bifurcation rules have the form 

(transform (from (A)) (to (B C))) 
where A, B and C are nonterminals. 

Chain: A substitution rate matrix (the name comes from "continuous-time Markov chain" ) . The states 
of the substitution process are A-mers augmented with an optional hidden variable. That is, the 
state space of a chain consists of state-tuples of the form (si, S2, ■ ■ ■ , Sn, h) with N > 1, where 
si through sjv represent alphabet symbols (which will be observed in the final multiple sequence 
alignment) and h is an optional hidden state which can take on a finite set of single-character values 
specific to this chain. Each of the N alphabet symbols, si through sat, is associated with a unique 
pseudoterminal. Examples of valid chain state spaces include the set of all nucleotides; the set of 
all codons; and the set of all tuples (A, H) where A is an amino acid and H e {F, S} is a hidden 
binary variable taking values F (for fast) or 5* (for slow). 

Complement: An order-2 permutation on the tokens of an alphabet. (Typically only used for DNA or 
RNA alphabets.) 

Emission: A transformation rule that generates some pseudoterminals (and thus, some alignment columns); 
or the set of pseudoterminals (or alignment columns) generated by such a rule. In XRate, emission 
rules have the form A — >■ x\ . . . xl A* x^+i . . . a^L+i? where A, A* are paired nonterminals (whose 
names differ only by the final asterisk) and x\ . . . xl + r are pseudoterminals. (A* is referred to as 
the post-emit nonterminal.) Any numbers L, R of pseudoterminals can appear to the left and right 
of the A*, as long as L + R > 0. If L = and R > 0, the rule is a right- emission; if L > and 
R = 0, the rule is a left- emission. The pseudoterminals x\ . . . Xl+r must comprise (any permuta- 
tion of) the full set of pseudoterminals for a given substitution chain. Each pseudoterminal may 
optionally be prefixed with a tilde character (") to indicate that it should be complemented in the 
final alignment (used to generate reverse strands in double-stranded models). For example, if CI, 
C2 and C3 are the three pseudoterminals of a codon chain, A is an emission nonterminal and A* is 
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the corresponding post-emission nonterminal, valid emission rules could include 

(transform (from (A)) (to (CI C2 C3 A*)) (prob (...))) 

and 

(transform (from (A)) (to (~C3 A* ~C2 ~C1)) (prob (...))) 

Grammar: The contents of a grammar file: chains, nonterminals, transformation rules, and alphabet. 
(The alphabet is specified in a separate part of the file from the rest of the grammar, and so is 
sometimes omitted from this definition.) 

Grammar symbol: A symbol that is either a nonterminal or a pseudoterminal. 

HMM: Hidden Markov Model. An SCFG that is also a regular grammar. See also phylo-HMM. 

Hidden state: In the context of XRate, this term is ambiguous (see state). In this article, it is used 
mostly to refer to the final element of a state-tuple in a chain. However, in the context of HMM 
theory, it refers to what we call a nonterminal. 

Hybrid chain: A mapping from tree branches to substitution rate matrices {chains) where the in- 
stantaneous rate matrix may vary from one branch to another. This may be used to implement 
lineage-dependent selection, or other models which are heterogeneous with respect to the tree. 

Initial distribution: The initial probability distribution over states in a substitution chain. 

Left-emission: See emission. 

Left-regular: A grammar is left-regular if it contains no bifurcations and its emissions are all left- 
emissions. 

Macro: A construct that is expanded by the XRate grammar preprocessor and may be used to im- 
plement redundant or repetitive grammar models; e.g. grammars with a large number of similar 
transformation rules sharing the same probability parameter, or substitution chains whose mutation 
rules all share the same rate parameter. 

Multiple sequence alignment: The raw data on which XRate operates, and which constitutes its 
input and output. XRate cannot align sequences, but assumes that they have been pre-aligned using 



an external alignment program. Alignments must be converted to Stockholm format 36 before 
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supplying them to XRate. The alignment may include a phylogenetic tree (using the Stockholm 
syntax for specifying this); if no tree is provided, XRate's tree estimation routines can be used to 
find one. 

Mutation rule: A single element in the rate matrix of a substitution chain. 

Nonterminal: A grammar symbol that may be transformed, by application of transformation rules, into 
other nonterminals or pseudoterminals. In XRate, a nonterminal must be exclusively associated 
with (that is, appear on the left-hand side of) either emission rules, transition rules or bifurcation 
rules. 

Parameter: A named parameter in a grammar. May be a probability parameter or a rate parameter. 

Parametric model: A grammar whose transformation rules or mutation rules (or both) are specified 
as functions of the grammar's parameters, rather than as direct numerical values. 

Parse tree: A tree structure corresponding to the derivation of a multiple sequence alignment from a 
grammar. Each tree node is labeled with a grammar symbol: the root node is labeled with the 
start nonterminal, internal nodes are labeled with nonterminals, and the leaves are labeled with 
pseudoterminals. Not to be confused with a phylogenetic tree. 

PGroup: A set of probability parameters collectively representing a probability distribution over a finite 
set of events. Following training, probability parameters constituting a PGroup will be normalized 
to sum to 1. 

Phylogenetic tree: The evolutionary tree describing the relationship between sequences in a multiple 
alignment. XRate uses the Stockholm format for alignments, which allows the tree to be included 
as an annotation of the alignment. If no tree is provided, XRate's tree estimation routines can be 
used to find one. 

Phylo-grammar: See phylo-SCFG. 

Phylo-HMM: A phylo-SCFG that uses a regular grammar. A phylo-HMM is an HMM whose emissions 
generate alignment columns by evolving substitution chains on a phylogenetic tree. 
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Phylo-SCFG: A phylogcnetic SCFG: a member of the general class of grammars implemented by XRate. 
A phylo-SCFG is an SCFG whose emissions generate alignment columns by evolving substitution 
chains on a phylogenetic tree. 

Post-emit nonterminal: See emission. 

Production rule: See transformation rule. 

Probability parameter: A dimcnsionless parameter that generally takes a value between and 1, and 
so can occur in the probability part of a transformation rule (or as a multiplying factor in the rate 
part of a mutation rule). Probability parameters are declared in PGroups. 

Pseudocounts: A set of nonnegative counts that specifies a Dirichlet prior distribution over a PGroup. 

Pseudoterminal: A grammar symbol that is generated via an emission and cannot be further modified 
by subsequent transformation rules. In a parse tree, a pseudoterminal serves as a placeholder for 
an alignment column. Pseudoterminals occur in groups associated with a particular substitution 
chain. In the generative interpretation of the model, alignment columns are generated using the 
initial distribution and mutation rules of the chain, applied on the phylogcnetic tree associated with 
the alignment. 

Rate parameter: A nonnegative parameter that has units of "inverse time" (i.e. rate), and so can 
occur in the rate part of a mutation rule. Rate parameters can be declared individually. 

Regular grammar: A grammar is regular if it is either left-regular or right-regular, that is, it contains 
no bifurcations and its emissions are all either left- emissions or right- emissions. A regular grammar 
is equivalent to an HMM. 

Right-emission: See emission. 

Right-regular: A grammar is right-regular if it contains no bifurcations and its emissions are all right- 
emissions. 

SCFG: Stochastic Context-Free Grammar. See also phylo-SCFG. 

Start nonterminal: The first nonterminal declared or used in a grammar. In the generative interpre- 
tation of the model, this is the initial grammar symbol to which transformation rules are applied. 
It is also the label of the root node in the parse tree. 
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State: In the context of a phylo- grammar, this term is ambiguous: it can refer either to a state-tuple in 
a chain, or (for phylo-HMMs) a nonterminal in a grammar. For the most part in this paper, and 
exclusively in this glossary, we use it in the former sense. 

State space: The set of possible state-tuples in a chain. 

State-tuple: A tuple of the form (si, S2, . . . , sjv, h) representing a single state in a chain, where s\ 
through s n represent alphabet symbols and h is an optional hidden state. 

Substitution chain: A continuous-time finite-state Markov chain over state-tuples. See chain. 

Substitution model: See substitution chain. 

Terminal: See token. 

Token: An alphabet symbol. (Also called a terminal.) 

Training: The use of XRate to estimate a grammar's parameters, mutation rule rates and transformation 
rule probabilities, given a (set of) multiple alignments. This occurs after tree estimation and prior 
to annotation or ancestral reconstruction. 

Transformation rule: A probabilistic rule that describes the transformation of a nonterminal symbol 
into a sequence of zero or more grammar symbols. (Also called a production rule.) A transformation 
rule may be an emission, a transition or a bifurcation. 

Transition: A transformation rule that generates exactly one nonterminal (and no pseudoterminals) . 
Transition rules have the form 

(transform (from (A)) (to (B)) (prob (...))) 
where A and B are nonterminals. 

Tree: In the context of a phylo- grammar, this term is ambiguous: it can mean a parse tree (which explains 
the "horizontal" , i.e. spatial, structure of an alignment) or a phylogenetic tree (which explains the 
"vertical", i.e. temporal, structure). 

Tree estimation: The use of XRate to estimate a phylogenetic tree for a multiple sequence alignment, 
given a grammar. This occurs prior to training, annotation or ancestral reconstruction. 
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B Tables of Scheme functions in Darts 

The following list of Scheme functions, natively implemented within selected DART programs (including 
XRate) when compiled with GNU Guile, is only complete to the date of publication. A more up-to-date 
list may be found at |http : / /biowiki . org/Dart SchemeFu nctions| 

B.l Functions for working with trees 

Scheme function Effect 



(newick-from-string x) 

(newick-from-file x) 
(ncwick-from-stockholm x) 

(newick-to-file x y) 
(newick-ancestor-list x) 
(newick-leaf-list x) 
(newick-branch-list x) 
(newick-unpack x) 



Create a tree-smob from a 
Newick-format string x 
Create a tree-smob from a file x 
Create a tree-smob from the tree 
encoded within alignment-smob 
x 

Write tree-smob x to file y in 
Newick format 

List of all ancestors in the tree- 
smob x 

List of all leaves in the tree-smob 
x 

List of all branches in the tree- 
smob x 

Converts a tree-smob x into a 
Scheme data structure 
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B.2 Functions for working with alignments 

Scheme function Effect 



(stockholm-from-string x) 



(stockholm-from-file x) 



(stockholm-to-file x y) 



(stockholm-column-count x) 



(stockholm-unpack x) 



Create an alignment-smob from 
a Stockholm-format string x 
Create an alignment-smob from 
a Stockholm-format file x 
Write alignment-smob x to 
Stockholm-format alignment file 

y 

Return the number of columns in 
alignment-smob x 
Converts an alignment-smob x 
into a Scheme data structure 
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B.3 Functions for working with grammars 

Scheme function Effect 



(xrate-validate-grammar x) 



(xrate-validate-grammar-with-alignment x y) 



(xrate-estimate-tree x y) 



(xrate-annotate-alignment x y) 



(xrate-train-grammar x y) 



Validate the syntax of XRate 
grammar x 

Validate the syntax of XRate 
grammar x, using alignment - 
smob y to expand macro con- 
structs 

Use XRate grammar y to esti- 
mate a tree for alignment-smob 
x 

Use XRate grammar y to anno- 
tate alignmcnts-smob x 
Train XRate grammar y on the 
list of alignment-smobs y 
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B.4 Miscellaneous functions 

Scheme function 



Effect 



(dart-log x) 



(discrete-gamma-medians alpha beta K) 



(discrctc-gamma-means alpha beta K) 



(ln-gamma k) 



(gamma-density x alpha beta) 



(incomplete-gamma x alpha beta) 



Logging directive; equivalent to 
"-log x" at the command line 
Returns the median rates of 
K equal-probability bins of the 
gamma distribution 
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Returns the mean rates of K 
equal-probability bins of the 
gamma distribution |37j 
Calculates the gamma function, 



m = f °° e 



— x^.k—1 



dx 



Calculates the gamma probabil- 



ity density, 



r oc— Xp— 0x 



Calculates the incomplete 
gamma function, i.e. the inte- 
gral of the gamma density up to 



(incomplete-gamma-inverse p alpha beta) 



Calculates the inverse of the in- 
complete gamma function 



